Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Canonicalize API #8983

Merged
merged 8 commits into from
Feb 1, 2024
Merged

Improve Canonicalize API #8983

merged 8 commits into from
Feb 1, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jan 24, 2024

Which issue does this PR close?

Part of #8724
Follow on to #8780

Rationale for this change

The new canonicalize API added in #8780:

  1. Has to be explictly called (so existing users of the ExprSimplifer will not get its benefits by default)
  2. Has a diferent API than the existing ExprSimplifer::with_guarantees

What changes are included in this PR?

  1. Add Simplifier::with_canonicalize and update code to use it
  2. Add documentation and example

Are these changes tested?

Yes, with new doc example as well as existing code

Are there any user-facing changes?

Better api + documentation.

cc @Jefffrey

@github-actions github-actions bot added the optimizer Optimizer rules label Jan 24, 2024
///
/// assert_eq!(non_canonicalized, expr);
/// ```
pub fn with_canonicalize(mut self, canonicalize: bool) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the new api that is part of the ExprSimplifer that mirrors the guarantees

@@ -98,6 +96,7 @@ impl SimplifyExpressions {
// The left and right expressions in a Join on clause are not commutative,
// since the order of the columns must match the order of the children.
LogicalPlan::Join(_) => {
let simplifier = ExprSimplifier::new(info).with_canonicalize(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about moving the initialization of simplifier out like below? It looks cleaner. :)

        let simplifier = match plan {
            // Canonicalize step won't reorder expressions in a Join on clause.
            // The left and right expressions in a Join on clause are not commutative,
            // since the order of the columns must match the order of the children.
            LogicalPlan::Join(_) => ExprSimplifier::new(info).with_canonicalize(false),
            _ => ExprSimplifier::new(info),
        };
        let expr = plan
            .expressions()
            .into_iter()
            .map(|e| {
                // TODO: unify with `rewrite_preserving_name`
                let original_name = e.name_for_alias()?;
                let new_e = simplifier.simplify(e)?;
                new_e.alias_if_changed(original_name)
            })
            .collect::<Result<Vec<_>>>()?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍 I love it. I did it in e7095cd and it looks much nicer now. Thank you @yyy1000

Copy link
Contributor

@yyy1000 yyy1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good design! 😃
I helped review it and here are some of my thoughts.

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks clean 👍

Comment on lines +271 to +278
/// // Expression: a = c AND 1 = b
/// let expr = col("a").eq(col("c")).and(lit(1).eq(col("b")));
///
/// // With canonicalization, the expression is rewritten to canonical form
/// // (though it is no simpler in this case):
/// let canonical = simplifier.simplify(expr.clone()).unwrap();
/// // Expression has been rewritten to: (c = a AND b = 1)
/// assert_eq!(canonical, col("c").eq(col("a")).and(col("b").eq(lit(1))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, seeing this just made me realize the doc for Canonicalize struct actually states the opposite order:

https://github.com/apache/arrow-datafusion/blob/94a6192f6be30b7f6d009bc936a866bf5dcb280c/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L239

Probably can fix as part of this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, that's my previous fault.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 1bb7d09


// The left and right expressions in a Join on clause are not commutative,
// since the order of the columns must match the order of the children.
// Thus, not reorder expressions in a Join `on` clause while simplifying
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I just took a look at the join node:

https://github.com/apache/arrow-datafusion/blob/94a6192f6be30b7f6d009bc936a866bf5dcb280c/datafusion/expr/src/logical_plan/plan.rs#L2596-L2599

Was it actually a problem with the filter clause and not the on clause? 🤔

Because the on here is a vec of tuples to represent left and right (I was assuming it was a single Expr before)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the problem with on. 🤔
In my previous experience, it's a problem when the tuples of on keeps as before when optimizing, but the left and right of Join change, so they can't match.
Not pretty sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is not entirely clear what is wrong with canonicalizing the expressions for joins. I'll add a note to the comment explaining that we don't really understand the issue (maybe someone can look into it as a follow on)

@alamb alamb requested a review from yyy1000 January 25, 2024 13:00
Copy link
Contributor

@yyy1000 yyy1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@alamb
Copy link
Contributor Author

alamb commented Jan 27, 2024

Looks great!

Thanks for the review @yyy1000 -- I am now just waiting for another committer to approve the PR and I'll merge it

@alamb alamb merged commit e566329 into apache:main Feb 1, 2024
22 checks passed
@alamb
Copy link
Contributor Author

alamb commented Feb 1, 2024

@Jefffrey is now a committer (so thank you for your review @Jefffrey 🙏 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants